Proteomics, Post-translational Modifications, andIntegrative Analyses Reveal MolecularHeterogeneity within Medulloblastoma Subgroups¶
PROTEOMICS DATASET RE-ANALYSIS¶
Abstract¶
There is a pressing need to identify therapeutic targets in tumors with low mutation rates such as the malignant pediatric brain tumor medulloblastoma. To address this challenge, we quantitatively profiled global proteomes and phospho-proteomes of 45 medulloblastoma samples. Integrated analyses revealed that tumors with similar RNA expression vary extensively at the post-transcriptional and post-translational levels. We identified distinct pathways associated with two subsets of SHH tumors, and found post-translational modifications of MYC that are associated with poor outcomes in Group 3 tumors. We found kinases associated with subtypes and showed that inhibiting PRKDC sensitizes MYC-driven cells to radiation. Our study shows that proteomics enables a more comprehensive, functional readout, providing a foundation for future therapeutic strategies.
|experimental\_design| Figure 1: Summary of data types included in this study, depth of proteomic data types, and cohort composition.
Consensus clustering, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE) carried out separately on each data type predominantly revealed the known subgroups: Group 3, Group 4, and SHH (Figures (Figures2A,2A, S1-S3). The proteomic data also identified very stable subsets of two known subgroups. Here, we refer to Group 3 clusters as Group 3a (G3a) and Group 3b (G3b), and SHH clusters as SHHa and SHHb.
Figure 2: Comparison of clustering results.
Objectives¶
Use CKG’s analytics core to:
Re-analyze the proteomics dataset:
Preprocessing
Stratification looking at covariates: Age, Gender, Metastatic status and histology
Differential regulation including covariates
Functional enrichment
Identify patients subgroups
Integration of all omics datasets using Similarity Network Fusion (REF: https://www.nature.com/articles/nmeth.2810).
Idenfity clinical and molecular differences between the identifierd subgroups
Investigate correlating PTMs on relevant protein complexes
Loading the Data¶
In this analysis we use:
Clinical metadata
RNA-sequencing
Proteomics
Phosphoproteomics
Phosphotyrosine proteomics
Acetylomics
This datasets have been previously formated (samples x proteins) and standardized to UniProt identifiers (data available in /assets) (see notebook Meduloblastoma Data Processing.ipynb).
[2]:
import pandas as pd
import numpy as np
from ckg.analytics_core.analytics import analytics
from ckg.analytics_core.viz import viz
from ckg.graphdb_builder import mapping
from ckg.graphdb_connector import connector
from plotly.offline import init_notebook_mode, iplot
%matplotlib inline
init_notebook_mode(connected=True)
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.0, the latest is 0.2.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
**kwargs
C:\Users\sande\.conda\envs\ckgenv\lib\site-packages\outdated\utils.py:18: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.10, the latest is 0.3.11.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
**kwargs
WGCNA functions will not work. Module Rpy2 not installed.
R functions will not work. Module Rpy2 not installed.
[3]:
clinical_data = pd.read_excel('../../assets/samples.xlsx', header=0)
rnaseq_data = pd.read_csv("../../assets/rnaseq.tsv", sep='\t', header=0)
proteomics_data = pd.read_csv("../../assets/proteomics.tsv", sep='\t', header=0)
phosphoSTY_data = pd.read_csv("../../assets/phosphoSTY.tsv", sep='\t', header=0)
phosphoY_data = pd.read_csv("../../assets/phosphoY.tsv", sep='\t', header=0)
acetylomics_data = pd.read_csv("../../assets/acetylomics.tsv", sep='\t', header=0)
1. Re-analyze the proteomics dataset¶
[4]:
proteomics_data.head()
[4]:
index | Q6MZP0 | Q8NF91 | Q8WXH0 | Q15149 | Q15149.1 | Q15149.2 | Q15149.3 | P58107 | B4DTV0 | ... | A0A0A0MR82 | Q9HD23 | P78325 | O43763 | P28749 | A0A024QZT7 | Q96MC2 | W5XKT8 | Q5VZ18 | group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MB018 | 1.068076 | -0.696709 | 0.931591 | 2.420232 | 2.420232 | 2.436102 | 2.429754 | 1.690194 | 0.122202 | ... | NaN | 0.620531 | NaN | -2.797946 | NaN | NaN | NaN | -0.290428 | 1.607669 | GR3 |
1 | MB095 | -2.623296 | 1.433604 | -2.755208 | -0.515201 | -0.522668 | -0.515201 | -0.522668 | -0.398223 | -0.983114 | ... | NaN | NaN | -2.658141 | NaN | NaN | NaN | -1.707383 | NaN | NaN | GR3 |
2 | MB106 | -0.923848 | -1.281532 | -1.003560 | -1.279489 | -1.304015 | -1.304015 | -1.301972 | 0.318850 | -2.362762 | ... | NaN | -0.443529 | NaN | 0.339289 | NaN | NaN | NaN | -0.504846 | -0.686754 | GR3 |
3 | MB170 | -2.355968 | 1.185584 | -2.055772 | 0.250797 | 0.252697 | 0.241297 | 0.246997 | 0.294496 | 0.203297 | ... | NaN | NaN | -4.029845 | NaN | NaN | NaN | -0.075999 | NaN | NaN | GR3 |
4 | MB226 | 0.077085 | -3.470951 | -0.535310 | -1.351123 | -1.353264 | -1.359688 | -1.359688 | -1.571671 | -2.201195 | ... | NaN | -0.668067 | NaN | 0.246243 | 1.181965 | NaN | NaN | NaN | NaN | GR3 |
5 rows × 13127 columns
Preprocessing:¶
Remove WNT samples (low number of samples)
Filter proteins with more than 80% missing values in all groups
Normalize data using median-centered normalization
Impute missing values using a mixed model: K-Nearest Neighbors algorithm (KNN) if at least 60% valid values, otherwise Minimum Probability (MinProb) drawing random values from a down-shifted Gaussian distribution
Include Protein name in columns using CKG’s mapping functionality
Include relevant clinical features: Age, Gender, Metastatic status, histology
[5]:
proteomics_data = proteomics_data[proteomics_data['group'] != 'WNT']
[6]:
valid_proteins = analytics.extract_percentage_missing(proteomics_data, missing_max=0.2, drop_cols=['index'], group='group', how='all')
proteomics_data = proteomics_data[valid_proteins + ['index', 'group']]
[7]:
proteomics_data = analytics.normalize_data(proteomics_data, method='median', normalize='samples')
[8]:
proteomics_data = analytics.imputation_mixed_norm_KNN(proteomics_data, index_cols=['index','group']).reset_index()
proteomics_data = proteomics_data.rename(columns={'index':'sample'})
[9]:
m = mapping.getMappingFromDatabase(proteomics_data.columns, node="Protein", attribute_from='id', attribute_to='name')
columns = [m[c]+"~"+c if c in m and m[c] is not None else c for c in proteomics_data.columns]
proteomics_data.columns = columns
[10]:
proteomics_data.head()
[10]:
sample | group | FLJ10292~A0A023T6R1 | RBM8~A0A023T787 | A0A024QYR3 | TM9SF2~A0A024QYR8 | A0A024QYS2 | AP1S1~A0A024QYT6 | EIF3S8~A0A024QYU9 | RABL5~A0A024QYV3 | ... | PMF1-BGLAP~U3KQ54 | CSNK1G1~U3KQB3 | LSM4~U3KQK1 | HEL-76~V9HW21 | MPND~W4VSR2 | YP_003024026 | YP_003024028 | YP_003024029 | YP_003024030 | YP_003024036 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MB018 | GR3 | 0.305789 | 0.121693 | -0.306807 | -1.420907 | -1.992240 | 0.080430 | -0.586125 | -0.170322 | ... | 1.010434 | -0.398855 | 0.362923 | 2.222930 | 0.454971 | 0.169304 | -3.687196 | -1.779577 | 0.673982 | 1.286578 |
1 | MB095 | GR3 | 0.161486 | 0.004686 | -0.117270 | 0.385487 | 0.106731 | -0.577716 | 0.938022 | 0.397931 | ... | 0.440243 | 1.082378 | 0.149042 | -0.338782 | 1.171978 | 1.405934 | 0.320776 | 1.109756 | -0.313893 | 1.475623 |
2 | MB106 | GR3 | 0.246131 | 0.722363 | 0.252263 | 0.248175 | 0.037652 | 1.270131 | 0.865436 | 0.188902 | ... | 1.186331 | -0.450842 | -0.465150 | -2.932151 | -0.575521 | 0.571113 | -0.064543 | 0.368766 | -1.517764 | 1.302834 |
3 | MB170 | GR3 | 0.158128 | 0.141028 | -1.245953 | -1.109155 | -1.055955 | -1.152854 | 0.448824 | -0.016670 | ... | 0.942817 | -0.839358 | 0.255027 | 1.818705 | 0.072629 | -0.073669 | -0.537262 | -1.215553 | 0.564722 | -0.225667 |
4 | MB226 | GR3 | 0.829814 | 0.617831 | -0.223676 | -0.799670 | -0.551286 | -0.803952 | 0.647809 | -0.435659 | ... | 1.281616 | 0.088945 | -0.294337 | 0.016143 | -0.225818 | -0.465636 | -0.119610 | -1.510562 | 0.589995 | -0.559851 |
5 rows × 11137 columns
[11]:
proteomics_data = proteomics_data.set_index(['sample', 'group']).join(clinical_data.set_index(['sample', 'group'])).reset_index()
proteomics_data = proteomics_data.drop('Proteome-based ', axis=1)
proteomics_data.loc[proteomics_data['Metastatic Status'].isna(), 'Metastatic Status'] = 'M3'
[12]:
proteomics_data.head()
[12]:
sample | group | FLJ10292~A0A023T6R1 | RBM8~A0A023T787 | A0A024QYR3 | TM9SF2~A0A024QYR8 | A0A024QYS2 | AP1S1~A0A024QYT6 | EIF3S8~A0A024QYU9 | RABL5~A0A024QYV3 | ... | MPND~W4VSR2 | YP_003024026 | YP_003024028 | YP_003024029 | YP_003024030 | YP_003024036 | Age At Diagnosis (years) | Gender | Metastatic Status | Histology | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MB018 | GR3 | 0.305789 | 0.121693 | -0.306807 | -1.420907 | -1.992240 | 0.080430 | -0.586125 | -0.170322 | ... | 0.454971 | 0.169304 | -3.687196 | -1.779577 | 0.673982 | 1.286578 | 13.0 | F | M0 | LCA |
1 | MB095 | GR3 | 0.161486 | 0.004686 | -0.117270 | 0.385487 | 0.106731 | -0.577716 | 0.938022 | 0.397931 | ... | 1.171978 | 1.405934 | 0.320776 | 1.109756 | -0.313893 | 1.475623 | 3.0 | M | M0 | classic |
2 | MB106 | GR3 | 0.246131 | 0.722363 | 0.252263 | 0.248175 | 0.037652 | 1.270131 | 0.865436 | 0.188902 | ... | -0.575521 | 0.571113 | -0.064543 | 0.368766 | -1.517764 | 1.302834 | 3.0 | M | M3 | LCA |
3 | MB170 | GR3 | 0.158128 | 0.141028 | -1.245953 | -1.109155 | -1.055955 | -1.152854 | 0.448824 | -0.016670 | ... | 0.072629 | -0.073669 | -0.537262 | -1.215553 | 0.564722 | -0.225667 | 16.0 | M | M3 | LCA |
4 | MB226 | GR3 | 0.829814 | 0.617831 | -0.223676 | -0.799670 | -0.551286 | -0.803952 | 0.647809 | -0.435659 | ... | -0.225818 | -0.465636 | -0.119610 | -1.510562 | 0.589995 | -0.559851 | 5.0 | M | M3 | classic |
5 rows × 11141 columns
Sample Stratification¶
We use Principal Component Analysis to visualize whether the groups cluster together.
[13]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Gender', 'Metastatic Status', 'Age At Diagnosis (years)', 'Histology'], group='group', annotation_cols=['sample'], components=2)
[14]:
args.update({'title':"PCA Medulloblastoma groups", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_groups", plot_format='png', directory='.')
We can also color based on possible covariates: Age, Gender, Metastatic status and histology
[15]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group', 'Metastatic Status', 'group', 'Gender', 'Histology'], group='Age At Diagnosis (years)', annotation_cols=['sample'], components=2)
[16]:
args.update({'title':"PCA Medulloblastoma by Age", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_age", plot_format='png', directory='.')
[17]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Age At Diagnosis (years)', 'group', 'Metastatic Status', 'Histology'], group='Gender', annotation_cols=['sample'], components=2)
[ ]:
[18]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group','Age At Diagnosis (years)' , 'Gender', 'Histology'], group='Metastatic Status', annotation_cols=['sample'], components=2)
[19]:
args.update({'title':"PCA Medulloblastoma by Metastatic Status", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_metastatics", plot_format='png', directory='.')
[20]:
pca, args = analytics.run_pca(proteomics_data, drop_cols=['Proteome-based', 'Gender', 'group', 'Metastatic Status', 'group', 'Gender', 'Age At Diagnosis (years)'], group='Histology', annotation_cols=['sample'], components=2)
[21]:
args.update({'title':"PCA Medulloblastoma by Histology", 'loadings':10, 'factor': 400})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="PCA_Medulloblastoma_histology", plot_format='png', directory='.')
We can also stratify looking at biological processes (GO terms) common to the studied cohort. We use single-sample GSEA (REF:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2783335/) to project each sample onto a 2D space of normalized gene set enrichment scores.
We extract the Gene Ontology annotation from CKG’s graph database using the connector module.
[22]:
go_terms_query = "MATCH (p:Protein)-[]-(bp:Biological_process) WHERE (p.name+'~'+p.id) IN {} RETURN DISTINCT (p.name+'~'+p.id) AS identifier,bp.name AS annotation"
go_terms_query = go_terms_query.format(proteomics_data.columns.tolist())
annotation = connector.run_query(go_terms_query)
[23]:
annotation.head()
[23]:
annotation | identifier | |
---|---|---|
0 | mitochondrial genome maintenance | AKT3~Q9Y243 |
1 | mitochondrial genome maintenance | MGME1~Q9BQP7 |
2 | mitochondrial genome maintenance | SLC25A33~Q9BSK2 |
3 | mitochondrial genome maintenance | LONP1~P36776 |
4 | mitochondrial genome maintenance | OPA1~O60313 |
[24]:
ssgsea_data = analytics.run_ssgsea(data=proteomics_data.drop(['Metastatic Status', 'Gender', 'Histology', 'Age At Diagnosis (years)'], axis=1), annotation=annotation, annotation_col='annotation',
identifier_col='identifier', set_index=['group', 'sample'], outdir=None,
min_size=10, scale=False, permutations=0)
[25]:
pca, args = analytics.run_pca(ssgsea_data['nes'], drop_cols=[ ], group='group', annotation_cols=['sample'], components=2)
[26]:
args.update({'title':"Functional PCA Medulloblastoma groups", 'loadings':25, 'factor': 15})
figure = viz.get_pca_plot(pca, identifier='pca_plot', args=args)
iplot(figure.figure)
viz.save_DASH_plot(figure, name="Functional_PCA_Medulloblastoma_groups", plot_format='png', directory='.')
Differential Regulation¶
We perform ANCOVA analysis including covariates: Age, Gender, Metastatic status and Histology.
[27]:
pro_anova_result = analytics.run_ancova(proteomics_data, covariates=['Age At Diagnosis (years)', 'Gender', 'Histology', 'Metastatic Status'], drop_cols=['sample'], subject=None, permutations=0)
[28]:
pro_anova_result.shape
[28]:
(33405, 23)
[29]:
anova_significant = pro_anova_result[pro_anova_result.rejected & (pro_anova_result['posthoc padj'] <= 0.01)]
[30]:
anova_significant['identifier'].unique().shape
[30]:
(2590,)
[31]:
volcano_plot = viz.run_volcano(pro_anova_result, identifier='proteomics_volcanos',
args={'alpha':0.01, 'fc':2, 'num_annotations': 80,
'colorscale':'Blues', 'showscale': False,
'marker_size':8, 'x_title':'log2FC', 'y_title':'-log10(pvalue)'})
i = 0
for plot in volcano_plot:
iplot(plot.figure)
viz.save_DASH_plot(plot, name="volcano_plot_"+str(i), plot_format='png', directory='.')
i += 1
Functional Enrichment Analysis¶
[32]:
enrichment_results = analytics.run_up_down_regulation_enrichment(pro_anova_result, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh', alpha=0.01, lfc_cutoff=1)
[ ]:
[33]:
figures = viz.get_enrichment_plots(enrichment_results, identifier='enrichment', args={'width':2200})
i = 0
for fig in figures:
iplot(fig.figure)
viz.save_DASH_plot(fig, name="enrichment_"+str(i), plot_format='png', directory='.')
i += 1